# badggplot(data) +geom_point(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +theme_minimal() +labs(title ="Age and length of stay of patients at 10 hospital trusts", x ="Patient Age (years)", y ="Patient Length of Stay (Days)")
# betterggplot(data) +geom_point(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +theme_minimal() +labs(title ="Age and length of stay of patients at 10 hospital trusts", x ="Patient Age (years)", y ="Patient Length of Stay (Days)")
Readable Code
If using the tidyverse or ggplot2 then start a new line after each %>% or +
data %>%filter(organisation_name =="Trust1") %>%ggplot(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +geom_point() +theme_minimal()
Readable Code
Use functions to avoid repeating lines of code
plot_function <-function(org_name) { age_los_plot <- data %>%filter(organisation_name == org_name) %>%ggplot(aes(x = patient_age, y = length_of_stay, colour =as.factor(death_flag))) +geom_point() +theme_minimal() +labs(x ="Patient Age (Years)",y ="Length of Stay (Days)") age_los_plot}orgs_list <-list("Trust1", "Trust2", "Trust3")purrr::map(orgs_list, plot_function)
When you are naming new variables or functions choose names that are descriptive. Do not duplicate names
# badmodel_a <-glm(data$patient_age ~ data$length_of_stay, family =gaussian())model_b <-glm(as.factor(data$death_flag) ~ data$patient_age, family =binomial())# bettermodel_los_age <-glm(data$length_of_stay ~ data$patient_age, family =gaussian())model_death_age <-glm(as.factor(data$death_flag) ~ data$patient_age, family =binomial())
Naming Things
For naming files again use descriptive names. If working on a larger project then consider having a separate file for each stage of the project, and make it clear what order the analysis has been done in.
For example: 01_data_cleaning.R 02_baseline_characteristics.R 03_descriptive_stats.R 04_models.R 05_figures.R
Organising Your Work
Within an R script you can use sections to organise your scripts.
Insert a new section using ctrl + shift + R and navigate using the document outline on the right of the script
Organising Your Work
Organising Your Work
Working within an R Project is a good way to organise not only your R scripts but keeps all the data and outputs from your work in the same place.
Avoids the need to use set_wd() at the start of your scripts, which is not best practice, particularly when collaborating with others.
Organising Your Work
set_wd() uses absolute file paths, e.g.
setwd("C:/Users/mfbx9sbk/OneDrive - The University of Manchester/MSc Teaching/coding_best_practice_2")
This can cause problems when you are collaborating with others, as not everyone will have their files organised in the same way.
Organising Your Work
R Projects use relative file paths, which are relative to the working directory of the project.
For example, you want to save a cleaned version of your data, or a plot you have generated.
Organising Your Work
Here the file paths are relative to the Project directory
ggplot(data) +geom_point(aes(x = patient_age, y = length_of_stay))
So if you shared the project with another person then it would not matter where they saved the project, all the file paths would work.
Organising Your Work
Organising Your Work
Organising Your Work
Organising Your Work
To set up an R Project go to File -> New Project
Organising Your Work
Organising Your Work
Organising Your Work
Organising Your Work
Use the README.MD document to briefly describe your project, including what you have done and what the output is.
Version Control
If you have ever had a bunch of files that look something like this then you may want to consider using a version control system to manage your projects
Version Control
Using a version control system can:
help organise your work and keep track of updates and changes
make it easier to collaborate with others
create a repository that can be shared more widely when a project is complete
be difficult to navigate at first but quickly become integrated into your regular workflow
Version Control
The most widely used (in the data science community) software for version control is Git. Git takes snapshots of all files in a project at a specific time - referred to as a “commit”. It stores the initial version and any subsequent updated versions that are committed. It tracks any changes you have made at each commit, which can be identified using the “diff” command
Version Control
GitHub is a complementary hosting platform for your repositories (others are available). Once updates have been committed to Git they can be “pushed” to GitHub. Collaborators can then “fork” a copy of the repository and work on it locally whilst you are also still working on it, by pushing and pulling commits to GitHub.
Version Control
What a repository looks like on GitHub
Version Control
I would recommend reading this article which explains in more detail about how to use Git and GitHub.
Version Control
Git can be integrated into RStudio and therefore more easily be incorporated into your workflow. Once installed an additional tab in the environment pane will appear, where you can commit and push files.
Version Control
Or you can go into the RStudio terminal tab and type Git commands from there
Version Control
To install Git and connect it to your GitHub and RStudio then follow this tutorial by Jenny Bryan. It talks through each setup step and how to do basic Git commands.